Hey All, Is there a way to define a more abstract/...
# hamilton-help
g
Hey All, Is there a way to define a more abstract/generic type hint to enable easy swapping of objects? For example: DAG:
from langchain_community.document_loaders.pdf import BasePDFLoader
def pdf_pages(pdf_path: Union[str, Path],
pdf_loader: BasePDFLoader) -> List[Document]:
return pdf_loader(pdf_path).load()
Driver:
from langchain_community.document_loaders import PyMuPDFLoader
results = dr.execute(final_vars = outputs, inputs={"pdf_loader" : PyMuPDFLoader)
PyMuPDFLoader inherits from BasePDFLoader. I'm getting the following error: Error: Type requirement mismatch. Expected pdf_loader:typing.Type[langchain_community.document_loaders.pdf.BasePDFLoader] got class 'langchain_community.document_loaders.pdf.PyMuPDFLoader':class 'abc.ABCMeta' instead. My motivation comes from wanting to check the DAG with a different Langchain PDF loaders. Thanks!
t
Hi Gilad! The code behavior you want should work! In the specific snippet you shared, I believe there's an error in the Driver code. This
Copy code
..., inputs={"pdf_loader": PyMuPDFLoader})
Should be (it's missing the
()
for object instantiation)
Copy code
..., inputs={"pdf_loader": PyMuPDFLoader()})
g
Hey Thierry šŸ™‚ In this specific case - the PyMuPDFLoader needs to be initialized with the pdf_path itself (it's a weird implementation by langchain) so it actually should be inserted like I did PyMuPDFLoader
šŸ‘ 1
t
Here's some more details on classes and subclasses for type hints, taking scikit-learn as an example. The following code is valid
Copy code
from sklearn.linear import LinearRegression

# returns specific
def trained_model() -> LinearRegression:
  return LinearRegression()

# expects generic, which includes the specific
def prediction(trained_model: BaseEstimator):
  return ...
While this isn't
Copy code
from sklearn.linear import LinearRegression

# returns generic
def trained_model() -> BaseEstimator:
  return LinearRegression()

# expects specific, which may conflict with the generic
def prediction(trained_model: LinearRegression):
  return ...
Ok! That's an interesting API šŸ˜… Do you mind sharing the code of the node
pdf_loader
?
g
from langchain_community.document_loaders import PyMuPDFLoader pdf_loader = PyMuPDFLoader https://api.python.langchain.com/en/latest/_modules/langchain_community/document_loaders/pdf.html#PyMuPDFLoader
t
Ok, but I'm guessing you can't do
inputs={"pdf_loader": PyMuPDFLoader(pdf_path=...)}
because the
pdf_path
is injected in the Hamilton DAG (i.e., it comes from another node) ?
g
Hm............. my intuition tells me I'd prefer to have the path in the DAG, but I'm not sure.... šŸ™‚
t
Error: Type requirement mismatch. Expected pdf_loader:typing.Type[langchain_community.document_loaders.pdf.BasePDFLoader] got <class 'langchain_community.document_loaders.pdf.PyMuPDFLoader'>:<class 'abc.ABCMeta'> instead.
This error code is due to
Driver.execute()
. It tells me that you have a valid DAG with
pdf_loader: langchain_community.document_loaders.pdf.BasePDFLoader
but there's a mismatch with the
pdf_loader
value passed in inputs.
Would it makes sense to have a function/node to create this object?
Copy code
def pdf_loader(pdf_type: abc.ABCMeta, pdf_loader_config: dict) -> BasePDFLoader:
  return pdf_type(**pdf_loader_config)  # this creates a `BasePDFLoader`

def my_other_func(pdf_loader: BasePDFLoader, ...) -> ...:
  return ...
Driver code
Copy code
dr.execute(..., inputs={"pdf_type": PyMuPDFLoader, pdf_loader_config={"pdf_path": ...}})
If the LangChain API doesn't provide the necessary flexibility to create the object in the Driver code, I would opt for this option. You don't lose visibility IMO and you can add a docstring explaining what the node does and justify your design decision (i.e., why not in Driver code)
g
I don't like to have pdf_loader_config. I would prefer to have pdf_path explicitly in the inputs. I'm currently using "Any", which is not optimal, but works.
def pdf_pages(pdf_path: Union[str, Path],
pdf_loader: Any) -> List[Document]:
return pdf_loader(pdf_path).load()
šŸ‘ 1
or maybe I'll take your implementation and add pdf_path and additional_loader_config šŸ™‚
šŸ‘ 1
t
That's another valid option! Maybe it's worth documenting (via the type hint or docstring) that
pdf_loader
is a type rather than an object?
šŸ‘ 1
g
Yeah, you're right. Thanks a lot Thierry šŸ™‚
t
Each person has their own approach, but if you have a main config, one approach I like for ML models is to do:
Copy code
def model_config(model_config_overrides: Optional[dict] = None) -> dict:
  config = dict(...)
  if model_config_overrides:
    config.update(**model_config_overrides)  

  return config

def base_model(model_config: dict) -> BaseEstimator:
  return LinearRegression(**model_config)

def trained_model(base_model: BaseEstimator, ...) -> BaseEstimator:
  # fit model
  return base_model
This has the following benefits: • the most common config is directly in my code
model_config()
and I don't have to specify a config to use
dr.execute()
• if I want to change a few values at the same time, I can use
model_config_override
. This will be very explicit from
dr.execute()
and I can inspect the return value of
model_config()
if I need to debug • for full control or testing, I can use
dr.execute(..., overrides={"model_config": dict()})
to explicitly pass a config and skip potential bugs in
model_config()
It might be a bit overengineered, but you might find it useful if you do a lot of experimentation! example: https://github.com/zilto/ordinal-forecasting-digital-phenotyping/blob/master/src/tabular_model.py#L164
g
Thanks! It helps a lot. Since I like things to be configurable and swappable - I would prefer not to "hardcode" the LinearRegression part but to pass it to the DAG as input and define it outside of the DAG so that I can easily test and compare different models. Does that make sense to you?
t
Yup! So that would depend on the level of flexibility. • If you have a known set of models to include, you can use
@config.when()
(e.g., swap between LinearRegression, SVM, HistGradientBoostingRegressor). • If you want even more flexibility / dynamism, you can write
Copy code
def model_config(model_config_overrides: Optional[dict] = None) -> dict:
  config = dict(...)
  if model_config_overrides:
    config.update(**model_config_overrides)  

  return config

# you can set a default value if you want
def base_model(model_config: dict, model_type: type = LinearModel) -> BaseEstimator:
  return model_type(**model_config)

def trained_model(base_model: BaseEstimator, ...) -> BaseEstimator:
  # fit model
  return base_model
With this much flexibility though, you probably want to check if the
model_config
will be valid for the
model_type
šŸ‘ 1
g
Looks good. thanks! can you explain this last sentence "t`ig` will be valid for the
model_type"
t
updated the comment, accidentally deleted some text
šŸ‘ 1